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Abstract. Compressed full-text indexes have been one of pattern matching's most important success 
stories of the past decade. We can now store a text in nearly the information-theoretic minimum of 
space, such that we can still quickly count and locate occurrences of any given pattern. However, some 
files or collections of files are so huge that, even compressed, they do not all fit in one machine's internal 
memory. One solution is to break the file or collection into pieces and create a distributed index spread 
across many machines (e.g., a cluster, grid or cloud). Suppose we want to search such an index for many 
patterns. Since each pattern is to be sought on each machine, it is worth spending a reasonable amount 
of time to preprocess the patterns if that leads to faster searches. In this paper we show that if the con- 
catenation of the patterns can be compressed well with LZ77, then we can take advantage of their simi- 
larities to speed up searches in BWT-based indexes. More specifically, if we are searching for t patterns 
of total length m in a distributed index for a text of total length n, then we spend 0(m + (g + t) log m) 
time preprocessing the patterns on one machine and then £>(((/ + t) log 2 mlog 1+£ n) time searching for 
them on each machine, where g is the size of the smallest straight-line program for the concatenation of 
the patterns. Thus, if the concatenation of the patterns has a small straight-line program — plausible 
if the patterns are similar — and the number of machines is large, we achieve a theoretically significant 
speed-up. The techniques we use seem likely to be of independent interest and we show how they can 
be applied to pattern matching with wildcards and parallel pattern matching. 



1 Introduction 



Compressed full-text indexes have revolutionized some areas of pattern matching, offering both 
nearly optimal compression and fast searching simultaneously, but other areas have yet to benefit 
from them. For example, when Navarro and Makinen [10] wrote their survey of such indexes, most 
of the literature on them dealt with exact, single-pattern matching with one processor, with a 
few notable papers dealing with approximate matching. Since then, research on those topics has 
continued and research has begun on, e.g., matching with wildcards [8], parallelized searching and 
distributed indexes [12]. As far as we know, however, there has been no previous work on designing 
indexes for multi-pattern matching in the sense of, say, the Aho-Corasick algorithm [2]. That is, 
although indexing really makes sense only when searching for multiple patterns — if we are to 
search only for one, then it is faster to do so directly than to first build an index — the standard 
approach is to search for the patterns separately (see, e.g., [3, pg. 1]) without taking advantage of 
possible similarities between them. In this paper we show that, if the concatenation of the patterns 
can be compressed well with LZ77 [16], then we can take advantage of their similarities to speed 
up searches in indexes based on the Burrows- Wheeler Transform (BWT) [5]. Since running LZ 
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Fig. 1. The BWT of mississippi and the intervals [2,5], [7,8] and [3] corresponding to i, p and ip, respectively. 



can be as time-consuming as searching in the index directly, we consider the case when we are 
searching for many patterns in an index for a file so large that, even compressed, it does not all fit 
in one machine's memory and must be stored as a distributed index spread across many machines 
(e.g., a cluster, grid or cloud). Since each pattern is to be sought on each machine, it is worth 
spending a reasonable amount of time to preprocess the patterns if that leads to faster searches. 
More specifically, if we are searching for t patterns of total length m in a distributed index for 
a text of total length n, then we spend 0(m + (g + 1) logm) time preprocessing the patterns on 
one machine and then o((g + t) log 2 mlog 1+e rij time searching for them on each machine, where 
g is the size of the smallest straight-line program for the concatenation of the patterns. Thus, if 
the concatenation of the patterns has a small straight-line program — plausible if the patterns 
are similar — and the number of machines is large, we achieve a theoretically significant speed-up. 
Such a speed-up could be of practical importance in several bioinformatics applications, in which 
both very large files and multi-pattern matching are common [9, 11]. 

The BWT sorts the characters in a text T (possibly with a special end-of-string character $ 
appended) into the lexicographic order of the suffixes immediately following them. As a result, 
any pattern has a corresponding interval in the BWT, containing the character immediately before 
each occurrence of that pattern. For example, if T = mississippi, then bwt(T) = ipssm$pissii; the 
intervals corresponding to patterns i, p and ip are [2, 5], [7, 8] and [3], respectively. This is illustrated 
in Figure 1. 

BWT-based indexes are among the most competitive compressed full-text indexes known. They 
can store a text in nearly the information-theoretic minimum of space while still allowing us to 
quickly count and locate occurrences of any given pattern. For more information on the BWT 
and on such indexes in general, we refer the reader to the recent book by Adjeroh, Bell and 
Mukherjee [1]. Barbay, Gagie, Navarro and Nekrich [4] very recently gave one such index that stores 
a text T[l..n] over an alphabet of size a in nH^{T) +o(n)(Hk(T) + 1) bits, for all k < (1 — e) log^ n—1 
simultaneously, such that: 
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— given a pattern P[l..m], in O (m log log a) time we can find the endpoints of the interval in 
bwt(T) containing the character immediately before each occurrence of P in T (and, thus, 
count the number of occurrences by taking their difference and adding 1); 

— given a character T[i] 7 s position in bwt(T), in o(\og l+e nj time we can find i. 

The second query is called locate and will, together with the inverse of it that we define in Section 2, 
be useful to us not just as a query to be implemented for its own sake, as is usual, but also as a 
primitive to implement other queries. 

Our idea is to take advantage of long repeated sub-patterns: after all, if we have already spent the 
time searching for a sub-pattern, then we would like to avoid searching for it again. To quantify our 
advantage, our analyses are in terms of the number of phrases in the LZ77 parse of the concatenation 
of the patterns, and the size of the smallest straight-line program that generates the concatenation. 
To find common sub-patterns, we compute the LZ77 parse of the concatenation of the patterns. 
We consider the version of LZ77 that requires the match to be completely contained in the prefix 
already parsed. The parse is defined in terms of a greedy algorithm: if we are parsing a pattern 
P[l..m] and have already processed P[l..i], then we look for the longest prefix of P[i + l..m] that 
we have already seen; we record the position and length of the matching sub-pattern, or P[i + 1] if 
it does not exist. Rytter [14] showed that the number of phrases in this LZ77 parse for a string is a 
lower bound on the size of the smallest context-free grammar in Chomsky normal form (or straight- 
line program) that generates that string and only that string. He also showed how to convert the 
LZ77 parse into a straight-line program with a logarithmic blow-up. 

In Section 2 we define the anti-locate query, show how it can be implemented by adding o(n) 
bits to Barbay et al.'s index and, as a warm-up, show how it is useful in pattern matching with 
wildcards. In Section 3 we show how we can use locate and anti-locate to find the interval in bwt(T) 
for the concatenation of two sub-patterns in polylogarithmic time, assuming we already know the 
intervals for those sub-patterns. This means that, if we have a straight-line program for a pattern, 
then we can find the interval for that pattern using polylogarithmic time per distinct non-terminal 
in the program. From this we obtain our speed-up for multi-pattern matching in a distributed 
index. In Section 4 we show how we can parallelize the searching, obtaining a speed-up linear in the 
number of processors. In Section 5 we discuss some other possible applications, and we summarize 
our results in Section 6. 

2 Anti-Locate and Pattern Matching with Wildcards 

Like many other BWT-based indexes, Barbay et al.'s [4] uses a o(n)-bit sample to support locate, 
which takes the position of a character in bwt(T) and returns that character's position in T. 
We can store a similar sample for the inverse query: anti-locate takes the position of a character 
in T and returns that character's position in bwt(T). We store the position in bwt(T) of every 
(log n log log n)th character of T; given i, in 0(1) time we find the character whose position in 
bwt(T) we have stored and that is closest to T[i] in T; then from that we use rank and select 
queries to find T[i]'s position in bwt(T) in ( log n log log re log log cr) C o(log 1+e nj time. 

Lemma 1. We can add o(n) bits to Barbay et al. 's index such that it supports anti-locate in 
o(log 1+t nj time. 

To give a simple illustration of how anti-locate can be useful, in the rest of this section we 
apply it to pattern matching with wildcards. Lam, Sung, Tarn and Yiu [8] gave an 0(relogro)-bit 
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index for a text T[l..n] such that, given a pattern P[l..m] = P 1 ^ W1 P 2 1 W2 . . . P t containing a total 
of w occurrences of the wildcard symbol ? and t maximal sub-patterns P\,...,Pt containing no 
wildcards, we can find all substrings of T matching P in 0(m + t min/ l {occ(P/ 1 ,)}) time when each 
wildcard must be replaced by a character; when wildcards can be replaced or ignored, we use 
0(m + wt min/ l {occ(P/ l )}) time. (Notice we can ignore wildcards at the beginning or end of P.) 
We give the first compressed index for pattern matching with wildcards, by showing how we can 
search our index from Lemma 1 in O (m log log a + t min/ l {occ(P/ i )} log 1+e nj time when wildcards 

must be replaced, or O (m log log a + wt mmf l {occ(Ph)} log 1+e n) time when they can be replaced 
or ignored. 

First assume that each wildcard must be replaced by a character. We first find the intervals in 
bwt(T) corresponding to each of Pi , . . . , Pt, which takes 0{m log log a) time. We choose the shortest 
such interval, which has length min/j{occ(P/j)}; suppose it is for sub-pattern Pj. If j < t, we check 
each position i in the interval for Pj to see whether 

anti-locate(locate(i) + \Pj\ + Wj) 

is in the interval for P/+i; if so, then we have found the starting position in bwt(T) of a substring 
in T matching Pjl w i Pj+i- If j = t, we check each position i in the interval for Pj to see whether 

anti-locate(locate(i) — Wj^i — |Pj_i|) 

is in the interval for Pj-i; if so, then we have found the starting position in bwt(T) of a substring in 
T matching Pj-\l w ^ x Pj. In either case, this takes O (occ(Pj) log 1+e nj time and yields the positions 
in bwt(T) for at most occ(Pj) matching substrings. For example, suppose we want to match s??s 
in T = mississippi. The interval for s is [9, 12] and 

anti-locate(locate(9) + |s| + 2) = anti-locate(9) = 7 
anti-locate(locate(10) + |s| + 2) = anti-locate(6) = 9 
anti-locate(locate(ll) + |s| + 2) = anti-locate(8) = 8 
anti-locate(locate(12) + |s| + 2) = anti-locate(5) = 11 , 

so the intervals in the intersection, [9] and [11], correspond to the two substrings, siss and ssis, that 
match s??s. 

Because of the wildcards, the positions we have found may not be consecutive. Nevertheless, we 
can repeat the procedure above for each of them, to append or prepend sequences of wildcards and 
sub-patterns, again using a total of O^occ(Pj) log 1+e n^ time for each sub-pattern. It follows that 

we can find all the substrings of T matching P in O^mlogloga + t min/ l {occ(P/ l )} log 1+e rij time. 

Now assume wildcards can be replaced or ignored. We proceed much as before but whenever we 
would check anti-locate(locate(i) + \Pj\ + Wj) or anti-locate(locate(i) — Wj — \Pj\) for some i and j, 
we now check anti-locate(locate(i) + \Pj\ +w'j) or anti-locate(locate(i) — w'j — \Pj\) for < w'j < Wj. 

Calculation shows that the whole procedure now takes O (m log log a + wt min/j{occ(P^)} log 1+e nj 
time. 

Theorem 1. We can build an (nHk(T) + o(n(Hk(T) + \)))-bit index for a text T[l..n\ such that, 
given a pattern P[l..m] containing w wildcards and t maximal sub-patterns P\, . . . ,P t containing no 
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wildcards, we can find all substrings of'T matching P in 0^m log logo" + t min^{occ(P^)} log 1+e nj 

time when wildcards must be replaced, or o(m log log a + wtvai\i} l {occ{Pf l ) } log 1+e rij time when 
they can be replaced or ignored. 

3 Concatenating Sub-Patterns 

We now describe the key observation behind our main result: how we can use anti- locate when we 
have found the intervals in bwt(T) corresponding to sub-patterns Pi and P2 and now want to find 
the interval corresponding to P1P2. Let ip x and jp 1 be the endpoints of the interval for Pi, let ip 2 
and jp 2 be the endpoints of the interval for P2, and let ip 1 p 2 and jp 1 p 2 be the endpoints of the 
interval for P1P2 as shown in Figure 2. 

Notice that, since every occurrence of P1P2 in T is also an occurrence of Pi, [ip 1 p 2 ,jp 1 p 2 ] is a 
subinterval of [ip 1 ,jp 1 ]- Also, if i is in [ip 1 ,jp 1 ] but anti-locate(locate(i) + |Pi|) is strictly before 
ip 2 then, by the definition of the Burrows- Wheeler Transform, T[locate(i) + |Pi| + L.locate(i) + 
|Pi| + IP2I] is lexicographically strictly less than P2, so T[locate(i) + L.locate(i) + |Pi| + IP2I] is 
lexicographically strictly less than P1P2 and i is strictly before ip 1 p 2 . (Recall that the characters in 
the interval [ip,jp] for a pattern P occur in T immediately before occurrences of P; therefore, if i 
is in [ip, jp], then the corresponding occurrence of P is T[locate(i) + L.locate(i) + |P|], rather than 
T[locate(i)..locate(i) + |P| — 1].) If anti-locate(locate(i) + |Pi|) is in [ip 2 ,jp 2 ], then T[locate(i) + 
|Pi| + l..locate(i) + |Pi| + IP2I] is an occurrence of P2, so T[locate(i) + L.locate(i) + |Pi| + IP2I] 
is an occurrence of P1P2 and i is in [ip 1 p 2 , jp 1 p 2 ]- Finally, if anti- locate (locate (i) + |Pi|) is strictly 
after jp 2 , then T[locate(i) + |Pi| + L.locate(i) + |Pi| + IP2I] is lexicographically strictly greater than 
P2, so T[locate(i) + L.locate(i) + |Pi| + IP2I] is lexicographically strictly greater than P1P2 and i is 
strictly after jp 1 p 2 . Figure 2 illustrates these three cases. It follows that, by using binary search in 
[iPiJPi], w e can find i Pl p 2 and j Pl p 2 in o(log(j Pl - i Pl )log 1+e nj = 0(k>gocc(Pi) log 1+e nj time. 
For example, suppose T = mississippi and we have found the intervals [2,5] and [7,8] for i and p, 
respectively (shown in Figure 1), and now want to find the interval for ip. A binary search through 
the values 

anti-locate(locate(2) + |i|) = anti-locate(ll) = 1 



anti-locate(locate(3) + 
anti-locate(locate(4) + 
anti-locate(locate(5) + 



|) = anti-locate(8) = 8 
|) = anti-locate(5) = 11 
|) = anti-locate(2) = 12 



shows that only 8 is in [7, 8], so the interval for ip is [3]. We show all four of the values above for 
the sake of exposition, even though a binary search requires us to evaluate only some of them. 

We can improve this by sampling the value anti-locate(locate(i)+^) for every (£log 2+e n)th posi- 
tion i in bwt(T), for 1 < I < n, which takes o{^2,^ ^ [~2tt^ ) = o{n) bits. We can now find ip 1 p 2 and 
jp l P 2 as follows: we first use binary search in the values sampled for i = |Pi| in the interval [ip 1 ,jp 1 ] 
of bwt(T) to find subintervals of length at most |Pi| log 2+e n that contain ip x p 2 and jp x p 2 , which 
takes O(logn) time since we do not need to perform locate and anti-locate queries here; we then use 
binary search in those subintervals to find i p 1 p 2 and jp 1 p 2 , which takes O (\og \P\ \ log 1+e n log log nj 
time. The log log n factor can be hidden within the log 1+e n factor. 
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ii\ iPiP? JPiPi — — — — 





Jpi ip2 

Fig. 2. The BWT of T with the intervals [ip^jp^ and [ip 2 ,jp 2 ] corresponding to Pi and P 2 shown in 
grey and the interval [ip 1 p 2 ,jp 1 p 2 ] corresponding to P1P2 shown in black. The three cases shown are: 

— if ip ± < i < ip 1 p 2 then anti-locate(locate(i) + |Pi|) < ip 2 (left arrow); 

— if ip 1 p 2 < i < jpiP 2 then ip 2 < anti-locate(locate(j) + |Pi|) < jp 2 (center arrow); 

— if jp x p 2 < i < jp 1 then anti-locate(locate(i) + |Pi|) > jp 2 (right arrow). 



Lemma 2. We can add o(n) bits to Barbay et al. 's index such that, once we have found the 
intervals in bwt(T) corresponding to Pi and P2, we can find the interval corresponding to P\Pi in 
C>(log|Pi|log 1+e n) time. 



Let P[l..m] be a pattern. Notice that, if we have a context-free grammar in Chomsky normal 
form that generates P and only P, also known as a straight-line program (SLP) for P, then we 
can find the interval in bwt(T) corresponding to P by applying Lemma 2 once for each distinct 
non-terminal X: assuming we have already found the intervals for the expansions of the symbols on 
the right-hand side of the unique rule in which X appears on the left, Lemma 2 yields the interval 
for the expansion of X. For example, for the SLP 



X-j — > x§x§ 

Xq — > X5X4 

X§ — > X4X3 
X4 —7- X3X2 
X3 — > X2X1 
X 2 ->• a 
Xx^b 




l^ 4 \ ^\ l^ 4 \ l^ 4> 

^2 ^1 ^2 ^2 ^1 ^2 pi ^2 pi ^2 ^2 ^1 

a b a a b a b a a b a a b 



which generates abaababaabaab, we perform searches for a and b to find the intervals for the (single- 
character) expansions X\ and X2 and apply Lemma 2 to find, in turn, the intervals for the expan- 
sions of X3, . . . , Xf. This is like working from the leaves to the root of the parse-tree (shown above 
on the right), but we note we need apply Lemma 2 only once for each distinct non-terminal, rather 
than for every node of the tree. 

Rytter [14] gave an algorithm for building an SLP with nearly minimum size. He first proved 
that the number of phrases in the LZ77 parse of P, even without allowing overlaps, is a lower 
bound on the size of any SLP. He then showed how to convert that parse into an SLP. To do this, 
he defined an AVL-grammar for P to be a context-free grammar in Chomsky normal form that 
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generates P and only P, such that the parse-tree has the shape of an AVL-tree. Notice this means 
the parse-tree has height O(logm), a fact we will use in Section 4. 

Suppose the first i phrases of the LZ77 parse encode we have already built an AVL- 

grammar for P[l..j] and the (i + l)st phrase is (b,£). Then we can build the AVL-grammar for 
P[l..j + i\ by splitting the parse-tree (as an AVL-tree) for P[l..j] between its (b — l)st and 6th 
leaves and between its (6 + i — l)st and (6 + £)th leaves, so as to obtain an AVL-grammar for 
P[b..b + £ + 1], then joining that to the right side of the AVL-grammar for P[l..j]. Rytter showed 
how to do this in C(log j) time while adding C(log j) new non-terminals. If we repeat this procedure 
for each phrase, then in O(glogm) time we obtain an SLP with O(glogm) non-terminals, where 
g is the size of the smallest SLP. 

If we apply LZ77 parsing to a sequence of patterns P±, . . . ,Pt one by one, while allowing matches 
to cross the boundaries of previously seen patterns, then we produce at most t more phrases than if 
we processed the concatenation of the patterns Pi ... Pt as a single string. To see why, consider the 
parse for the concatenation P\ . . . Pt. If any phrase crosses the boundary between patterns, then we 
break the phrase at the boundary. This increases the number of phrases by at most t. Therefore, 
since LZ77's greedy parsing is optimal, if follows that parsing Pi,...,P t separately produces at 
most t more phrases than parsing them concatenated. Therefore, Rytter's algorithm on such a 
parse produces an AVL-grammar with size 0((g + 1) logm,). Once we have such a grammar for 
Pi . . . Pt, we can perform t — 1 splits in 0(t logm) time to obtain grammars for each of Pi, . . . , Pt, 
adding O(ilogm) new non-terminals. Applying Lemma 2 to each non-terminal in these separate 
AVL-grammars, we obtain the following theorem. 

Theorem 2. We can build an (nH k (T) + o(n(H k (T) + \)))-bit index for a text T[l..n] such that, 
if we are given Rytter's AVL-grammar for the concatenation of t patterns Pi, . . . , Pt of total length 
m, parsed one by one, then we can search for Pi, . . . ,P t in O {^{g + t) log 2 m log 1+e rij time, where 
g is the size of the smallest SLP for the concatenation P\...P t . 

Of course, since computing the LZ77 parse takes linear time [6], Theorem 2 does not help us much 
when searching in only one index — we can achieve at best a factor of log log a speed-up over search- 
ing directly. If we compute Rytter's AVL-grammar on one machine and then send the resulting SLP 
to q machines each storing part of a distributed index, however, then we use 0(m + {g + t) logm) 
preprocessing time on the first machine but then only o({g + t) log 2 mlog 1+e nj time on each ma- 
chine, rather than 0(m log logo") on each machine. This is our main result for this paper, although 
we feel Theorem 2 and the techniques behind it are likely to prove of independent interest. 

Corollary 1. Given texts Ti,...,T q to be stored on q machines, we can build indexes of size 
\Ti\H k {Ti) + o{\Ti\(H k {Ti) + 1)), . . . , \T q \H k {T q ) + o(\T q \(H k (T q ) + 1)) bits such that, given t pat- 
terns Pi, . . . , Pt of total length m, we can spend 0(m + {g + t) log m) time preprocessing Pi, . . . , Pt 
on one machine, where g is the size of the smallest SLP for the concatenation Pi...Pt, then 
o(^(g + t) log 2 mlog 1+e nj time searching on each machine. 

In other words, when the patterns are compressible together (e.g., when they have small edit 
distance, or when most can be formed by cutting and pasting parts of the others) then we can 
greatly reduce the total amount of processing needed to search in a distributed index. 
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4 Parallel Searching 



Several authors (see [7] and references therein) have already studied parallelization of LZ77 or its 
variants, so in this section we focus on parallelizing the actual searches. Suppose we are searching for 
a pattern P[l..m] in a text T[l..n] on a machine with p processors. Russo, Navarro and Oliveira [12] 
gave an (niTfc(T) + o(n log a))-bit index for T that we can search in 

0(m/p + log n log log n (log p + log log n log logp)) 

time. For reasonably long patterns and not too many processors, their speedup is linear in p. 
We now show that if the smallest SLP for P has size g and we already have the AVL-grammar 
for P that results from Rytter's algorithm, then we can search our index from Lemma 2 in 
^(Jq/p] l°g 2 771 log 1+e nj time. That is, we achieve an unconditionally linear speedup over The- 
orem 2 and a better upper bound than Russo et al. when P is very compressible. 

Each non-terminal in the AVL-grammar can appear only at a specific 0(logm) height in the 
parse tree for P. We sort the non-terminals into non-decreasing order by height and process them 
in that order. Notice that the concatenation we perform for each non-terminal cannot depend on 
the concatenation for any non-terminal of equal or greater height. Therefore, we can parallelize 
the concatenations for non-terminals of the same height. For each height with r non-terminals, we 
use o(jr/p] log m log 1+e 7^) time. Since there are O(glogm) non-terminals in total and O(logm) 

possible heights, calculation shows we use log 2 m log 1+e nj time. 

Theorem 3. We can build an (nHk(T) +o(n(H}.(T) + l)))-bit index for a textT[l..n] such that, on 
a machine withp processors and given Rytter's AVL-grammar for a pattern P[l..m] whose smallest 
SLP has size g, we can search for P in Ol \g/p] log 2 m log 1+e n) time. 



5 Other Applications 

There are several other possible applications for the techniques we have developed in this paper. For 
example, we might be given patterns with wildcards to preprocess, where the characters that will 
replace those wildcards will be given to us later. Such a pattern could be a fragment of DNA with 
wildcards in the locations of single- nucleotide polymorphisms, which we can use as a re- useable 
template: we search in advance for the maximal sub-patterns not containing any wildcards so that 
later, given an assignment of characters to the wildcards, we can quickly find all occurrences of the 
filled-in pattern. We can even allow the wildcards to represent blanks of unknown length, so the 
template could be a part of the document with missing characters, words or phrases. 

In Section 3 we preprocess P±, . . . ,Pt first, send the resulting grammars to each machine, then 
use those grammars and Lemma 2 to speed up searches in each part of the distributed index. Since 
applying Lemma 2 on a specific machine yields an interval in the part of the distributed index stored 
on that machine, in this context it makes little sense to mix the preprocessing and the searching. 
In general, however, we can apply Lemma 2 to each non-terminal as it is created; we can also split 
off the SLP for each pattern when we have finished parsing that pattern. This could be useful if, for 
example, we want to search for the patterns as they are given to us, instead of batching them, and 
we are given the LZ77 parse instead of having to compute it ourselves. More generally, we could be 
given any set of instructions on how to form each new pattern by cutting and pasting parts of the 
patterns we have already seen. 
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We can extend this idea to consider maintaining dynamic libraries of sub-patterns: we keep 
an AVL-grammar for each sub-pattern, with the interval for each non-terminal stored; given in- 
structions on how to assemble a new pattern by cutting and pasting pieces of the sub-patterns, 
we (non-destructively) split the AVL-grammars for those sub-patterns and form the new pattern, 
simultaneously computing the interval for the new pattern. We leave as future work exploring these 
and other possible applications. We are currently investigating whether our results can be used to 
speed up approximate pattern matching in compressed indexes (see [13] for a recent discussion of 
this topic). Notice each pair of strings within a small edit distance of a pattern share long sub- 
strings; we can use our results and heuristics to explore adaptively the neighborhood around the 
pattern, pruning branches of our search once we know they cannot yield a match. 

In some of these applications, of course, Barbay et al.'s index may not be the best choice. 
We chose it for this paper because it has the smallest space bound but, when time is more im- 
portant than space, we could use, e.g., Sadakane's Compressed Suffix Array [15]: this index takes 
^^uHqIs) + 2nlog(iio(s) + 1) + 3n + o(n) bits, where 5\ and 82 < 1 are arbitrary positive con- 
stants, and supports both locate and (as we will show in the full version of this paper) anti-locate in 
o(\og &2 71/(6162)^ time without any additional data structures, rather than the o(log 1+t rij time 
we used in this paper. 

6 Conclusions 

We have shown how, if we have already found the intervals in the BWT corresponding to two 
patterns, then in polylogarithmic time we can find the interval corresponding to their concatenation. 
Combining this with a result by Rytter on constructing small grammars to encode strings, we have 
given a method for preprocessing a sequence of patterns such that they can be sought quickly in 
a BWT-based index. Although the preprocessing is not much faster than seaching for the patterns 
directly, it could be useful when we wish to search in a distributed index: we preprocess the patterns 
on one machine, then send them to all the machines, so that the cost of preprocessing is paid only 
once but the benefit is reaped for each machine. 

We have also shown how the same or similar techniques can be applied to matching a pattern 
with wildcards in an index, obtaining a slower but more space-efficient alternative to a theorem by 
Lam et al. [8]; and to parallel pattern matching, showing how, given a small SLP for a pattern, 
we can parallelize our faster search. We believe the techniques we have developed will prove of 
independent interest. 
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